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Online social networking sites are becoming more popular amongst Internet 
users. The Internet users spend some amount of time on popular social 
networking sites like Facebook, Twitter and LinkedIn etc. Online social 
networks are considered to be much useful tool to the society used by 
Internet lovers to communicate and transmit information. These social 


networking platforms are useful to share information, opinions and ideas, 


make new friends, and create new friend groups. Social networking sites 
Keywords: provide large amount of technical information to the users. This large amount 
of information in social networking sites attracts cyber criminals to misuse 
i : these sites information. These users create their own accounts and spread 
Malicious links Inerable:- information to -th . This. informati b 

; á vulnerable information to the genuine users. This information may be 
Online social network advertising some product, send some malicious links etc to disturb the 
Spammer natural users on social sites. Spammer detection is a major problem now days 
Twitter in social networking sites. Previous spam detection techniques use different 
set of features to classify spam and non spam users. In this paper we 
proposed a hybrid approach which uses content based and user based features 
for identification of spam on Twitter network. In this hybrid approach we 
used decision tree induction algorithm and Bayesian network algorithm to 
construct a classification model. We have analysed the proposed technique 
on twitter dataset. Our analysis shows that our proposed methodology is 
better than some other existing techniques. 
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1. INTRODUCTION 

Social networks are growing rapidly, with networks like Facebook and Twitter. These networks 
attract million users a month and they are becoming an important medium of communication [1]. Social 
media are primarily internet based tools for sharing and discussing information. In social media people can 
share comments, personal details, photos, videos and establish relationships with other users [2]. Increase in 
popularity of internet and social network sites, it is very easy to gather large amount of information. Due to 
this large amount of data present in social network sites it attracts malicious users [3]. The spammers use 
their own strategies to attract genuine users, spread misinformation and propaganda. Large scale of spam 
messages are produced by spammers and spread these messages into social networks. Spam messages 
mislead the users and occupy the band width. To detect these spam messages various spam detection 
classification algorithms are used like decision tree induction algorithm, KNN algorithm, Naive bayes 
algorithm, Neural network algorithm and SVM algorithms [4]. Spam detection methodologies are analyzed 
by various researchers. Various techniques and frameworks are developed for spam detection [5]. Various 
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spam detection methodologies used KNN and decision tree induction algorithms, but KNN algorithm is 
better than Decision tree induction algorithm for finding spam messages in social networks [6]. 

In this paper we proposed an integrated methodology to detect spam messages. In this work, the 
twitter dataset is used. Thereafter the dataset is pre-processed to obtain normalized set of features depending 
on which the activities of spammers were studied. The key features extracted were content based and user 
based features. After obtaining these features, features are combined to improve the spam detection accuracy. 
In order to improve detection of spam messages, a integrated proposed approach was devised which 
combines the advantages of the two classification algorithms ie Decision tree algorithm, Naive bayes 
algorithm. The improvement in spam detection is measured on the basis of accuracy parameters. The results 
obtained shows that integrated approach that combine algorithms outperforms other classification approaches 
in terms of overall accuracy. In section 2, several research works are described as related work. Proposed 
method is presented in section 3. Experimental analysis is presented in section 4. The paper is concluded in 
section 5 with future enhancements. 


2. RELATED WORK 

The issue of spam detection in social networking sites is very critical task. The researchers are 
interested to do their research work on these areas. Many researchers have concentrated to find efficient 
methods to identify spam. This section summarizes the major contribution of various researchers on spam 
detection in social networks. Benjamin Markines et. al [7] proposed a spam detection methodology with 
various supervised learning algorithms. These algorithms are evaluated on six different features such as 
TagSpam, TagBlur, DomFp, Numads, Plagiarism, ValidLinks. Kyumin Lee et. al [8] describes spam 
detection approach with honeypots and SVM machine learning algorithm. Xin Jin et. al [9] used GAD 
clustering algorithm to detect spammers in social networks. This approach deal with scalability and real time 
spam detection challenges. Xueying et. al [10] classify the social dataset messages into spam messages with 
using ELM algorithm. 

This classification model is developed with various features such as proportion spammer messages 
with original messages, messages containing URL’s and life time of account. Hongyu et. al [11] filter the 
spam messages over social network sites. Faraz Ahmed et. al [12] classify the spam profiles in online social 
networks with Markov clustering. In this approach a weighted graph is used. From this weighted graph find 
active friends, page likes and shared URLs features. Cheng Cao et. al [13] classify the spam with behavioural 
analysis of the users. To analyse the behaviour of network users use click based features and post based 
features. Saini Jacob Soman et. al [14] describes detecting malicious tweets in social network with user based 
features, location based features, content based features and text based features. 

With these features a classification model is proposed by SVM classifier and ELM classifier 
algorithm. Proposed ELM based spam detection approach performs better spam detection rate compare to 
SVM. Kaiyu Wang et. al [5] proposed a methodology to detect spam with combining of network features and 
textual features. With these features a spam detection model is constructed by SVM machine learning 
algorithm. They measured overall accuracy of model is increased up to 29%. Fabricio Benevenuto et. al [15] 
classify user profiles into spammers or non spammers based on content based features and user based 
features. To classify the users they have used support vector machine algorithm. Sajid Yousuf Bhat et. al [16] 
describes a methodology to detect spam users in social networks by ensemble learning methods. To train 
these learning algorithms facebook data is used. In this methodology network structure based features are 
used. Hailu Xu et. al [17] analyzed different features to detect spam in various social network sites such as 
facebook, twitter. 

Arushi Gupta et. al [18] propose a mechanism to detect spammers in twitter network. In their work 
used tweet level features, user level features, URL’s, spam word features. They have used integrated 
approach to develop a model with Naive Bayes classifier, clustering and decision trees. Xianghan Zheng et. 
al [19] studies a methodology to detect spammers in social network. They have used content based features 
and user based features along with SVM machine learning algorithm to construct a spammer detection 
model. Zahra Mashayekhi et. al [20] analyzed content based features and non content based features to detect 
spam in E-mail messages. They have combined decision tree algorithm and Neural Network algorithm to 
develop a model. They have implemented this model on Lingaspam data. Anjali Sharma et. al [4] analysed 
various spam detection techniques. 

They have studied various origin based spam detection techniques such as Blacklists filters, white 
lists filters, Realtime Blackhole list filters and content based spam detection techniques such as Rule based 
filters, Bayesian filters, Support vector machines and Artificial Neural Network algorithms. Saumya Goyal 
et. al [6] classify the spam messages in twitter network. They have used decision tree and KNN algorithm to 
construct a classification model. Chen Lin et. al [21] describes a spam detection procedure with Extreme 
Machine Learning (ELM) algorithm. They have analysed content based features, user based features and 
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social interactivity features. Prabhjot Kaur et. al [22] did the survey on various spam detection techniques. 
They have done the survey on various user based spam detection techniques, content based spam detection 
techniques, hybrid based techniques and relation based techniques. Bhagyashri Toke et. al [23] describes a 
spam detection methodology to detect spam in facebook dataset. 

They have used combined approach of Naive bayes algorithm and Rule based algorithm. Bhagyasri 
Toke et. al [24] analyse the integrated approach to detect spam in social network sites. Ala M et. al [25] 
describes a spam detection approach to detect spam in social networks. In this approach they have used 
various features like content features, user based features. They have used various feature selection methods 
such information gain, relief methods. Malik Mateen et. al [26] describes spam detection approach to detect 
spam messages in twitter data. Aziah khamis et. al[27] analysed various data mining algorithms to extract for 
islanding detection in power distribution. Alhamza et. al [28] studied various unsupervised classifiers to 
classify network flow. Sachin Kamley et. al [29] analysed various machine learning techniques such as 
Decision tree, Neural Network, Support Vector Machines, Genetic algorithms and Bayesian Networks for 
performance forecasting of Share Market. Sharvil Shah et. al. [30] studied various classifier algorithms for 
sentimental analysis in Twitter data. Our proposed approach used hybrid features i.e., combining of content 
based features, user based features and graph based features. 


3. PROPOSED METHOD 

Most of the previous research has been done in detecting spam messages and identifying spammer 
profiles. The previous research papers use different spam message detection and spammer profile detection 
methodologies. Every methodology uses its own dataset and features for data classification. Spam detection 
approaches used various kinds of features such as user based features, content based features and graph based 
features [19], [26]. Each feature set has its own advantages and disadvantages. Based on these features we 
proposed a methodology that uses combination of user based features and content based features. We use 
these features to construct a classification model that classify the messages into non spam and spam 
messages. In our approach we proposed an integrated classification model that uses decision tree algorithm 
and Naive bayes classifier. An overview of the complete process of spam detection is shown in diagram in 
Figure 1, each of process steps are explained in this section 
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Figure 1. Proposed spam detection approach 


A. Dataset description: To evaluate our proposed classification methodology use twitter dataset. There is no 
publicly available dataset due to twitter policy. So we had to build one such labelled dataset or use the 
dataset that are used in previous spam detection approaches. Most of the previous researchers have used 
their own dataset. The researchers extract their required dataset from twitter by using API. Here we used 
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manually collected dataset from twitter for our experimental purpose. The dataset consist of 100 users and 

900 tweets. 

B. Pre-processing: In this step, all continuous features were converted into discrete features. 

C. Feature Selection: From all the dataset features, process relevant features are selected. In this proposed 
methodology content based features and user based features are selected. These features are integrated to 
improve the spam detection accuracy. 

1. User Based Features: User based features are used to describe the behaviour of users in twitter [16]. 
These features are based on user relationships and properties of user accounts in twitter dataset. Generally 
in social media networks users can develop their own social networks with other users. In social network 
one user follows other users and allows other users to follow him. Spammers want to follow many 
profiles to spread misinformation to them, so they try to follow large number of users to spread 
misinformation. Generally we consider, the number of users following is more than number of users 
following him, such user account is considered as spam account. Here we are using different user based 
features to construct a model. User features are related to user accounts and the features are extracted 
from user accounts. The various user based features used in our approach are: 

a) Number of Followers: This feature specifies the number of other users in network follow your account 
tweets. Generally followers define the popularity of someone profile. Generally spammers have less 
popularity and have less number of followers. 

b) Number of Following: This feature specifies the number of other user accounts you follow. In twitter 
if you follow someone means you will see their tweets in your timeline. Twitter network knows to 
whom you follow and who is following you. 

c) Age of Account: This feature specifies when the account has been created. 

d) Follower to Following Ratio: This is the ratio of followers to following ratio in network for any user 
account. Generally ff ratio is less for normal users and this ratio is high for spammers. 


a I E Number of following 
TARD Number of followers 


e) Reputation: This is the ratio between number of followers to sum of following and followers 


Followers 


R tati = 
mena Followers + Following 


2. Content Based Features: These features are related to tweets posted by user. Generally normal users can’t 
post duplicate content but spammers post lot of duplicate tweets. Content based features are based on 
messages that users write. The content based features are important to detect spam messages. Spammers 
are malicious users, who spread large amount of misinformation to the network users [16], [17]. The 
misinformation contains advertisements about their product and malicious links. The various content 
based features are used in our approach are: 

a) Number of Tweets: Total number of tweets posted by user after creating his account. 
b) Hashtag Ratio: This is the ratio between the tweets containing hashtags to total tweets posted and 
those tweets containing unique hashtag. 


Duplicate HashTags 
Total Hashtags = —————————_. 
Unique Hashtags x Tweet Count 


c) URL’s Ratio: This is the ratio between duplicate URL’s to number of distinct URL’s in tweets and 
sum of tweets. 


Hash duplicate URLs 


Total URLs = Tt ____—___. 
ao i Hash Unique URLs x Tweet Count 


d) Mentions Ratio: Twitter account users are identified by @username. @username can be written 
anywhere in the tweet. Spammers misuse this feature to send spam messages to the genuine users in 
network. Generally the user messages contain large number of mention and reply tags then user is 
consider as spam user. 


Tweets Containing@ 
@Tweets = 


Total Number of Tweets 


e) Tweet Frequency: Generally the tweet frequency of spammers is greater than genuine twitter user. 
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f) Spam words: we use specific spam words and count their occurrence in tweets of users. The 
spammers use this spam words and spread misinformation to the users. 


D. Integrated approach: In our proposed approach we use supervised machine learning algorithms. These 
algorithms are first trained on the labelled data set to develop classification models. These models are 
applied on unlabelled data to predict which data as spam data and which data as non spam data. In our 
proposed approach we integrate decision tree induction algorithm and Naive Bayes classification 
algorithm to improve spam detection accuracy. In our methodology first decision tree algorithm classifies 
the dataset as spam or non spam. To improve the classification accuracy of spam detection, categorized 
spam records of decision tree is given as input to the Naive bayes algorithm. Naive bayes algorithm 
further classifies any misclassified messages into spam or non spam. In this way categorized non spam 
messages of decision tree are also given as input to the Naive bayes algorithm to classify the any 
misclassified messages. 

1. Decision Tree Induction: The decision tree is one of the known classification algo rithms used in machine 
learning to guide the decision making process [28]. Many researchers used [6], [20] this classification 
algorithm to detect spam messages. The decision tree has three types of nodes. The root node has no 
incoming edges and zero or more outgoing edges. An internal node has exactly one incoming edge and 
two or more outgoing edges. The leaf or terminal node has exactly one incoming edge and no outgoing 
edges. Decision tree induction algorithms must provide a method for expressing an attribute test condition 
for different types of attributes like binary, nominal, ordinal and continuous attributes. There are many 
measures that can be used to determine the best way to split the records. These measures are defined in 
terms of the class distribution the records before and after splitting. The measures developed for selecting 
the best split are often based on the degree of impurity of the child nodes. Different impurity measures are 


Entropy (t) = -LfZ9 p(i|t)logsp(ilt) 
Gini (t) = 1-Y=5 [pCi P 


Classification error (t) = 1- max[p(i|t)] 
l 


P (ilt) denote the fraction of records belonging to class i at a given node t. 
Different decision tree induction algorithms ID3, C4.5, CART, J48 


2. Naive Bayes Classifier- This is one of the best machine learning algorithm for spam 
classification [17], [26]. To classify the message as a spam or non spam can be generalized by probability 
theory. The spam messages contain the specific words. The relationship between the attribute set and 
class variable within dataset is non-deterministic. To resolve this problem Bayes theorem introduces a 
statistical principle for combining prior knowledge of the classes with new evidence gathered from given 
data. Let X and Y be a pair of random variables. The joint probability, P (X=x, Y=y), gives the 
probability that variable X will take on the value x and variable Y will take on the value y. The 
conditional probability is the probability that a random variable will take on a particular value given that 
the outcome of another random variable is known. The conditional probability p (X=x| Y=y), gives 
the probability that the variable Y will take on value y, given that the variable X is observed to have the 
value x. Based on joint and conditional probabilities. 


Bayes theorem, P(Y |X) = oO 


Y is the event that a given tweet belongs to a given class. X is the d dimensional feature vector 
corresponding to the tweet. The Naive bayes model makes the independence assumption that the 
attributes are all independent. 


POIL 
P(X) 
The demonstrator is same P(X) for spam and non spam classes. So we can discard for classification. 


P(Y|X) = 


P(Spam) [[, (Xi|Spam), and P(Non Spam)[]“, (Xi|NonSpam) 
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Classify the given messages with higher probability. The given training dataset was used to determine the 
conditional probabilities. 
4. EXPERIMENTS AND EVALUATION 
For classification task, we trained and tested integrated approach and compared with other machine 
learning algorithms. In order to evaluate the model, evaluation metrics (precision, recall and F measure) are 
used. The evaluation metrics are, Precision is the ratio of number of instances correctly classified to the total 
number of instances and is expressed as 


TP 


prectston = TP + FP 


Recall is the ratio of number of instances correctly classified to the total number of predicted instances and is 


expressed as 


TP 
TP+F 


Recall = 


F measure is the harmonic mean between precision and recall and is expressed as 


Precision * Recall 


F measur = 2 * = 
Precision + Recall 


Accuracy is the total number of correctly classified instances of both classes over the total number of all 
instances in the dataset. 


TP+T 


Accuracy = —————_—_—_ 
Y = TpyENGFPST 


We have evaluated the results using two most classifiers, i.e., Decision tree induction and Naive 
bayes classifiers. Figure 2 shows that results of decision tree and Naive bayes classifiers based on user based 
features. Naive bayes classifier has closer recall, precision and F-measure as compared to decision tree 
induction classifier. Figure 3 shows the results of classifiers based on content based features. Decision tree 
has the highest recall, precision and F-measure compare to Naive bayes algorithm. But Naive bayes 
algorithm has almost same precision value compared to decision tree. Figure 4 shows the precision, recall 
and F-measure of classifiers based on combining of user based features and content based features. Compare 
to individual performance of classifiers based on user based features and content based features hybrid based 
feature classifiers are outperformed. To improve the further performance of classification model we integrate 
the decision tree induction classifier and Naive bayes classifier. Figure 5 shows that performance of 
integrated model based on hybrid features. 

We have analysed our results with the earlier works [15], [17], [18], [26], [31]. The earlier works 
used user based features, content based features and hybrid features. We observe an amount of improvement 
in the results of our approach from the features used in [15]. The top 20 word features are used [17] for spam 
detection. Compared to earlier classifications, our approach has highest recall, precision and F measure. 


E Recall E Recall 


E Precision E Precision 


E F-Measure E F-Measure 


Decision: Iree Naive Bayes Decision Tree Naive Bayes 


Figure 2. Classification results using user based Figure 3. Classification results using content based 
features features 
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E Precision E Precision 
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Decision Naive Bayes Integrated 


Tree approach 


Figure 4. Classification results using hybrid features Figure 5. Classification results of proposed approach 


In Table 1 we describe the comparison of our integrated approach with other classification models. 


Compare to other classifications our hybrid features integrated model out performs with high recall, precision 
and F measure. 


Table 1. Comparison between Integrated Approach and Other Classifiers 


Classifier Features Recall Precision F measure 
Decision Tree User based 0.96 0.95 0.9560 
Naive bayes User based 0.96 0.95 0.9560 
Decision Tree Content based 0.92 0.91 0.9149 
Naive bayes Content based 0.62 0.88 0.7274 
Decision Tree Integrated features (user based, content based features) 0.96 0.95 0.9549 
Naive bayes Integrated features (user based, content based features) 0.75 0.90 0.8181 
Integrated Classifier Integrated features (user based, content based features) 0.97 0.96 0.9649 


( Decision Tree and Naive bayes ) 


5. 


CONCLUSION 
In this paper we used the hybrid set of features for detecting spam messages on social networking 


sites. We proposed an integrated spam detection technique to detect spam messages in twitter dataset. In this 
integrated approach we used decision tree induction classifier, Naive bayes classifier and hybrid features, 
such as combining of user based features and content based features. Our approach shows performance with 
high recall, precision and F-measure. In future we will extend our approach to other new set of features and 
also do our experiments on other social media networks with large amount of dataset. 
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